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[57] ABSTRACT 

A speech recognition system is extensible in that new terms 
may be added to a list of terms that are recognized by the 
speech recognition system. The speech recognition system 
provides audio feedback when new terms are added so that 
a user may hear how the system expects the word to be 
pronounced. The user may then accept the pronunciation or 
provide his own pronunciation. The user may also selec- 
tively change the pronunciation of words to avoid misrec- 
ognitions by the system. The system may provide appropri- 
ate user interface elements for enabling a user to change the 
pronunciation of words. The system may also include intel- 
ligence for automatically changing the pronunciation of 
words used in recognition based upon empirically derived 
information. 
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EXTENSIBLE SPEECH RECOGNITION aspect of the present invention, a method is practiced by a 

SYSTEM THAT PROVIDES A USER WITH computer-implemented speech recognition system that rec- 

AUDIO FEEDBACK ognizes speech input from a speaker. In accordance with this 

method, a text-to-speech mechanism is provided for creating 

TECHNICAL FIELD 5 a spoken version of text. The text-to-speech mechanism is 

utilized to generate a spoken version of a given word, and 

Present invention relates generally to data processing me spo k en version of the given word is output on the audio 

systems and more particularly, to an extensible speech output device so that a user of the speech recognition system 

recognition system that provides a user with audio feedback. knows how the speech recognition system expects a given 

10 word to be pronounced. The text-to-speech mechanism 
BACKGROUND OF THE INVENTION generates a pronunciation for the given word which cone- 
Speech recognition systems enable a computer system to sponds with the pronunciation that the speech recognition 
understand at least selected portions of speech that are input system expects to hear for the given word. In particular, the 
to the computer system. In general, speech recognition lext-to-speech mechanism may share the same letter-to- 
systems parse input speech into workable segments that may is sound component with the given speech recognition system 
be readily recognized. For example, input speech may be so that the pronunciation of the spoken version of the given 
parsed into phonemes that are further processed to recognize word generated by the text-to-speech mechanism is identical 
the content of the speech. Typically, speech recognition to the pronunciation that the speech recognition system 
systems recognize words in input speech by comparing the expects to hear. 

pronunciation of the word in the input speech with patterns 20 i n accordance with another aspect of the present 

or templates that are stored by the speech recognition invention, a list of pronunciations for words that are recog- 

system. The templates are produced using phonetic repre- nized by a dictation system are provided. A request is 

sentations of the word and context-dependent templates for received from a user to change a current pronunciation of a 

the phonemes. Many speech recognition systems include selected word that is stored in the list to a new pronunciation, 

dictionaries that specify the pronunciations of terms that are 25 The request specifies the new pronunciation. The pronun- 

recognized by the speech recognition system. ciation that is stored in the list for the selected word is 

One place in which speech recognition systems are used changed from the current pronunciation to the new pronun- 

is in dictation systems. Dictation systems convert input ciation in response to the request. 

speech into text. In such dictation systems, the speech In accordance with a further aspect of the present 
recognition systems are used to identify words in the input 30 invention, a method is performed by a computer- 
speech, and the dictation systems produce textual output implemented speech recognition system. A dictionary of 
corresponding to the identified words. Unfortunately, these terms that the speech recognition system recognizes is 
dictation systems often experience a high level of misrec- provided, and the dictionary specifies how the speech rec- 
ognition of speech input from certain users. The speech ognition system expects each term to be pronounced. A 
recognition systems employed within such dictation systems 35 request is received from a user to add a new term to the 
have one or more pronunciations for each word, but the dictionary and a pronunciation for the new term is generated 
pronunciations of the words are static and represent the by the speech recognition system. The pronunciation of the 
pronunciation that the speech recognition system expects to new term is output on an audio output device, using the 
hear. If a user employs a different pronunciation for a word text-to-speech mechanism (with the speech recognition sys- 
than that expected by the speech recognition system, the 40 tern's expected pronunciation for the new term as input), and 
speech recognition system will often fail to recognize the the new term as well as the generated pronunciation are 
user's input. This drawback can be especially vexing to a added to the dictionary. 

user when a term has multiple proper pronunciations and the j n acc0 rdance with yet another aspect of the present 

user employs one of the pronunciations that is not covered invention, multiple pronunciations for a selected term are 

by the dictionary of the speech recognition system. 45 stored in a dictionary of a speech recognition system. Each 

Another limitation of such dictation systems is that they of the pronunciations for the selected term is output on the 

are either not extensible (i.e., a user may not add a new term audio output device so that a user can hear the pronuncia- 

to the dictionary) or they permit the addition of new terms tion. In response to a user selecting one of the 

but generate their own pronunciation of the new term 5Q pronunciations, the selected pronunciation is used by the 

without allowing the user to discover the pronunciation(s). speech recognition system to recognize speech. 

Such systems may use letter-to-sound heuristics to guess at j D accordance with another aspect of the present 

the pronunciation of a newly added term. Unfortunately, invention, a dictionary of terms having pronunciations for 

such heuristics do not yield correct results in many instance. eacn lerm j s provided. The pronunciations correspond with 
Oftentimes, when a user adds a new term to extend the ^ D0W a speech recognition system expects the terms to be 

dictionary used in a dictation system, the user merely enters pronounced. In multiple instances where the speaker speaks 

the new term without providing a pronunciation, and the a selected one of the terms so that the speech recognition 

speech recognition system generates a pronunciation for the system recognizes the selected term, the specific pronuncia- 

new term. This new pronunciation may be incorrect or may t | on 0 f tne selected term the user used is determined. Based 
not correspond with the user's anticipated pronunciation of 6Q on ^ repeated determination, the system identifies which 

the word. As a result, there is often a high degree of 0 f lne alternative pronunciations of the selected term the 

misrecognition relative to speech input that uses the newly user ^ most ^ely using and updates the dictionary to 

added term or that includes the newly added term. designate that pronunciation as the pronunciation that the 

speech recognition system expects. 
SUMMARY OF THE INVENTION ^ ^ ^ a ^ ^ of the present 

The above-described limitations of the prior art are over- invention, the spoken version of a term that has given 
come by the present invention. In accordance with a first pronunciation is received from a speaker. An expected 
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pronunciation for the term is provided. The expected pro- The preferred embodiment of the present invention 

nunciation corresponds to how the speech recognition sys- enables a user to change the pronunciation of any of the 

tem expects the speaker to speak the term. The given words that are currently in the dictionary. In addition, the 

pronunciation of the spoken version of the term is compared preferred embodiment to the present invention enables a 

with the expected pronunciation to determine the degree of 5 t0 hear lhe pronunciation currently assigned to a word 

difference. Where the degree of difference exceeds an within the dictionary. The user may hear the pronunciation 

acceptable predetermined threshold, output is generated on and then change the pronunciation to an alternative pronun- 

an output device to inform the speaker that the degree of ™ Uon * d f red As f a resu f U ; me user may great y enhance 

difference exceeds the threshold. The output may also ^^ 0 ° f s I ^ llt,On ° f "* a ™ d by SP 

include an expected pronunciation of the term generated by io ^^^2'^ the dictionary used by the speech 

the text-to-speech mechanism. recognition system is automatically and transparently 

In accordance with an additional aspect of the present up dated by the system. The system processes a corpus of 

invention, a computer-implemented speech recognition sys- pronunciation data to derive alternative pronunciations for 

tem recognizes spoken speech from a speaker. An expected terms within the dictionary. When the user speaks a term and 

pronunciation is provided for a given word. The expected 15 me term - s p roper iy recognized, the system identifies which 

pronunciation constitutes how the speech recognition sys- 0 f the alternative pronunciations the user spoke. After a fixed 

tem expects a given word to be pronounced by the speaker. number of proper recognitions and comparisons, the system 

Statistics are gathered regarding how frequently the given obtains a degree of confidence in the pronunciation that the 

word as spoken from the speaker is misrecognized by the user is speaking and changes the dictionary (including 

speech recognition system. Where the statistics indicate that 20 propagating the change in observed entries to other unob- 

the given word is misrecognized more frequently than a served entries in a systematic way) to utilize that pronun- 

threshold value, the user is prompted by generating output ciation so as to enhance the degree of recognition realized by 

on the display device to correct the expected pronunciation me system. 

of the given word. T° e preferred embodiment of the present invention will 

25 be described below relative to a dictation system. In the 

BRIEF DESCRIPTION OF THE DRAWINGS dictation system, the user speaks into an audio input device, 

A preferred embodiment of the present invention will be such as a microphone, to enter spoken text. Hie dictation 

described in more detail below relative to the following s y slem cognizes the ^spoken text and produces correspond- 

fi ing text as part of a document. 1 nose skilled in the art will 

S ures * . , n appreciate that the present invention may also be practiced 

FIG. 1 is a block diagram of a computer system that is ^ contexts other than a d i ctat ion system. The present 

suitable for practicing the preferred embodiment of the invention applies more generally to speech recognition 

present invention. systems. 

FIG. 2 is a block diagram that illustrates components of FIG. 1 is a block diagram of a computer system 10 that is 

the speech recognition system. 35 smtab i e f or practicing the preferred embodiment of the 

FIG. 3 is a diagram that illustrates an example portion of present invention. The computer system 10 includes a cen- 

the dictionary. tral processing unit (CPU) 12 that has access to a primary 

FIG. 4 is a flowchart illustrating the steps that are per- memory 14 and secondary storage 16. The secondary stor- 

formed to enable a user to change the pronunciation of a age 16 may include removable media drives, such as a 

term in the dictionary. 40 CD-ROM drive, which are capable of reading information 

FIGS. 5A-5D depict user interface elements that are stored on a computer-readable medium (e.g., a CD-ROM), 

provided to enable a user to alter the pronunciation of a term The computer system 10 also includes a number of periph- 

of the dictionary. erd devices. These peripheral devices may include, for 

FIG. 6 is a flowchart illustrating the steps that are per- instance, a keyboard 18, a mouse 20, a video display 22, an 

formed to add a new term to the dictionary. 45 audio loudspeaker 24, and a microphone 26. The computer 

FIGS. 7A-7D illustrate user interface elements that are ^ stem , m ^ additionally include a modem 28, a sound card 

provided to enable a user to add a new term with a given 29 a network adapter 30 thai : enables the computer 

pronunciation to the dictionary. s y stem to interface with a network 32. The memory 14 hole* 

i-.^ o ■ n i_ , -ii / lL , ,u . program instructions and data for the dictation system 34, 

F G. 8 is a flowchart illustrating the steps that are per- *L & . . rmt . ' i- ^ t u» 
r . t u ... * a - t u- *u a- r so Th e instructions are run on the CPU 12 to realize the 
formed to alter the pronunciations stored within the dictio- ^ embodiment of the present invention. The dicta- 
nary without interactive user input. ^ system 34 may be ^ by applicatioD programs 35 , 

DETAILED DESCRIPTION OF THE such as word processing programs and messaging programs. 

INVENTION The dictation system includes a speech recognition system 

The preferred embodiment of the present invention pro- 55 

vides an extensible speech recognition system that provides Those skilled in the art will appreciate that the computer 

a user with audio feedback. Thus, when a user seeks to add system configuration depicted in FIG. 1 is intended to be 

a word or term to a dictionary used by the speech recognition merely illustrative and not limiting of the present invention, 

system, the user is provided with audio feedback that The present invention may also be practiced with alternative 

identifies how the system believes the term should be 60 computer system configurations, including multiple proces- 

pronounced. The user may then accept this pronunciation or sor systems and distributed systems. For purposes of the 

request that the pronunciation be changed. In one discussion below, it is assumed that the steps that are 

alternative, the user specifies the pronunciation of the word performed by the preferred embodiment of the present 

by spelling out how the word should sound. In another invention are at the direction of the dictation system 34 or 

alternative, the system provides the user with a list of 65 the speech recognition system 36. 

alternative pronunciations for the word and the user chooses A suitable speech recognition system for practicing the 

thejword. preferred embodiment of the present invention is described 
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in copending application entitled "Method and System for utilized by the speech recognition system 36. If the user 

Speech Recognition Using Continuous Density Hidden accepts the pronunciation (see step 66 in FIG. 4), the user 

Markov Models," application Ser. No. 08/655,273, which may activate the "OK" button 86. On the other hand, if the 

was filed on May 1, 1996, which is assigned to a common user is not happy with the pronunciation (see step 66 in FIG. 

assignee with the present application and which is explicitly 5 4) s the user may activate the "Change" button 87. In this 

incorporated by reference herein. FIG. 2 depicts the com- fashion, the user requests the change of the pronunciation of 

ponents of the speech recognition system 36 that are of tne selected term (step 68 in FIG. 4). 

particular interest to the discussion of the preferred embodi- ^ ^ {hcQ identifies a new pronunciation for the 

ment of the present invention. The speech recognition sys- sclea&d t erm (step 68 ^ F IG. 4). FIG. 5B shows a first 

tem 36 includes a speech recognition engine 40 that utilizes 10 alternative 5y which the syslem prov ides multiple alterna- 

a dictionary 42 and letter-to-sound rules 46. The dictionary { . ye ODUnciations for ^ term the user selects one of 

holds a list of the terms that are recognized by the speech ^ pronunciatioris . In particular, as shown in FIG. 5B, a 

recognition engine 40 and the associated pronunciations. ^ interface dement 88 ^ provid ed that asks users to select 

FIG. 3 depicts an example of a portion of the dictionary 42. fl prominciation from one of the pronunciations listed in the 

Each entry within the dictionary 42 includes a field 50 for 15 ^ 9Q ^ uger may cmcel the process of changiDg the 

identifying the associated term and a field 52 for specifying promm ciation by activating the "Cancel" button 94 or may 

the pronunciation of the term. FIG. 3 shows an example of Qne of the pronunciations within the list and hit the 

an entry for the term "Add." The identity of the term is held ((QK „ buUon n {Q accept the xiRCUtd pronunciation as the 

within field 54 and the pronunciation of the term is held in new defauU pronunciation for the term. 

field 56. The pronunciation of the term is specified in terms 1Q , , 

^ y r lu Those skilled m the art will appreciate that there may be 

ot phonemes multiple ways of generating the alternative pronunciations 

The speech recognition system 36 may also include a for £ ' haye ^ ^ {Q 

text-to-speech engine 44 for converting text into spoken {q k ; nunciations for eacn lerm that ^ stored 

output. The text-to-speech engine 44 has access to .the *^ dfcliooary. the systera m ay be 

dictionary 42 and the letter-to-sound rules 46 that convert 25 ^ ^ ^ ^ for each tcfm ^ adopt 

textual letters into corresponding sounds. The text-to-speech * pronunciations. Still further, the multiple pronun- 

engine 44 first uses the dictionary 42 to locate pronuncia- * be derived em from different pronun . 

tions and then resorts to using the letter-to-sound rules 46 ^ ^ have tQ ^ h recognition 

when the word being processed is not in the dictionary. „ 

Those skilled in the art will appreciate that the text-to- i 0 sys i em " . . . . , . ™ , A 

speech engine 44 need not be part of the speech recognition A second alternative is depicted in FIG 5C .In the second 

system.butramermaybepartofaseparatespeechsynthesis alternative, the system does not provide the alternative 

unit. Nevertheless, for purposes of the discussion below, it pronunciation; rather the user enters the alternative pronun- 

is assumed that the text-to-speech engine 44 is part of the ciation. A user interface element 96 like that depicted in FIG. 

speech recognition system 36. A suitable text-to-speech 35 5C is displayed andthe user spells out me new prommcu- 

system is dkcussed in the copending application entitled tion in text box 98 The user need not enter the phonemes for 

"Method and System of Run Time Acoustic Unit Selection the pronunciation but rather need only enter a sequence of 

for Speech Synthesis," application Ser. No. 08/648,808, letter, (i.e. a text string) that captures the desired pronun- 

which was filed on Apr. 30, 1996, which is assigned to a ciation of the word. For example if the user desires to spel 

common assignee with the present application and which is 40 out the pronunciations of the word orange, the user might 

explicitly incorporated by reference herein. Those skilled in enter the stnng "omj. The user may then hear how the 

the art will further appreciate that the speech recognition system interprets the string that was entered in the text box 

engine 40 and the text-io-speech engine may have their own 98 by activating button 100. The speech recogniUon system 

reactive dictionaries and letter-to-sound rules. 3« Presses the text stnng that was entered in the text box 

FIG. 4 is a flow chart that illustrates the steps that are 45 98 ^ the letter-to-sound rules and the dictionary If the 

V j 7. l u . i£ T n .Li D . user is satisfied with the resulting output pronunciation of 

performed by the speech recognition system 36 to enable a e £.„ ™ Tf . 

F , . ... c . ,u . • . .,i the term, the user mav activate the OK. button 1U2. It the 

user to chance the pronunciation of a term that is stored ,fc . * / »»• • 

... !" ! & . aT, •;• I. ofo ,„ l„„ ,u„ user wishes to not change the pronunciation, the user may 
within the denary 42. activate the "Cancel" button 104. If the user is not satisfied 
pronunciation of a given word (step 60 in FIG. 4). The user pronunciation but wishes to attempt to enter 
then identifies the term for which he wishes to hear the 50 t v v , . , t .u w 
pronunciation (step 62 in FIG. 4). FIG. 5A shows an another pronunciation the user types the alternative pro- 
example of a user'interface element 78 that is displayed «™*°n in ^ ™ box 98 and repeats the process, 
when the user makes a request to hear the pronunciation of Those skilled in the art will appreciate that other alterna- 
a word. The user interface element 78 includes a list 80 of lives may be used. For example, pronunciations may not be 
alternatives for a spoken word. In the example shown in 55 represented to the user as selectable text strings (as in the 
FIG 5 A, the words are organized alphabetically. The user first alternative), but rather may be associated with particular 
may move through the list 80 to select the desired word. In user interface elements, such as buttons, that the user may 
the example depicted in FIG. 5A the user has selected the active to hear alternative pronunciations. FIG. 5D shows an 
word "orange" that appears within the selection box 82. The example where buttons 93 are displayed and each button is 
user may then hear the pronunciation of the selected word 6 0 activatable to produce audio output for a separate pronun- 
(step 64 in FIG. 4) by activating button 84. A suitable means ciation. 

for activating the button 84 is to position a mouse cursor 85 After the user has identified an acceptable new pronun- 

on the button 84 and clicking a mouse button while the ciation (i.e., step 68 in FIG. 4), the system needs to update 

mouse cursor points at the button 84. the dictionary accordingly. Specifically, the system replaces 

The user hears the pronunciation of the word and can then 65 the pronunciation of the term within the dictionary with the 

make a determination whether the pronunciation is correct. newly identified pronunciation that is satisfactory to the user 
The output pronunciation is the default pronunciation that is (step 70 in FIG. 4). Also, the system may propagate the 
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change of the specific term to other terms in a systematic 
manner. For example, if a user pronounces "what" as "HH 
W AH T," then the change may be propagated to all words 
beginning with "wh" (e.g., "where" and "which"). Those 
skilled in the art will appreciate that, in alternative 
embodiments, the dictionary may hold multiple pronuncia- 
tions and have a single pronunciation as the default pronun- 
ciation. In such alternative embodiments, the change of 
pronunciation is merely a change in the default pronuncia- 
tion that is utilized by the speech recognition system 36. 

The speech recognition system of the preferred embodi- 
ment of the present invention is extensible in that new terms 
may be added to the dictionary. FIG. 6 is a flow chart 
illustrating the steps that are performed when a user desires 
to add a new term to the dictionary. First the user requests 
to add a new term to the dictionary (step 110 in FIG. 6). FIG. 
7A shows an example of one user interface mechanism that 
may be provided by the dictation system 34 to enable a user 
to add a term to the dictionary 42. FIG. 7 A depicts a window 
26 of an application program that utilizes the dictation 
system 34. The window 126 includes a menu bar 132 on 
which is included an Edit menu header 134. When the user 
positions a mouse cursor over the Edit menu header 134 and 
depresses a mouse button, the menu 136 is depicted which 
includes a menu item 138 for adding a new term to the 
dictionary. A user may select the item 138 by positioning the 
mouse cursor to point at the item 138 and lifting the mouse 
button or clicking on the item. The window 126 holds text 
126 that is produced by interpreting input that user has 
spoken within a microphone 26 by the dictation system 34 
that uses the speech recognition system 36. The current 
cursor position 130 is indicated in FIG. 7 A. 

After the user selects a menu item entry 138, a dialog box 
140, like that depicted in FIG. 7B, is displayed. This dialog 
box 140 asks the user to enter the text for the term that the 
user wishes to add to the dictionary. A text box 142 is 
provided within the dialog box 140. After the user enters the 
text, the user may continue the process of adding the new 
term by pressing the "OK" button 144 or may terminate the 
process by pressing the "Cancel" button 146. Hence, the 
user provides the text for the term that is to be added to the 
dictionary (step 112 of FIG. 6). The dictation system 34 
passes the text onto the speech recognition system 36. The 
speech recognition system provides the text to the dictionary 
42 and the letter-to-sound rules 46 to generate a pronuncia- 
tion for the new term (step 114 in FIG. 6). The resulting 
pronunciation is then output over the audio loud speaker 24 
to the user (step 116 in FIG. 6) so that the user can appreciate 
how the speech recognition system 36 expects the term to be 
pronounced. A user interface element 150 like that depicted 
in FIG. 7C may then be displayed to enable the user to 
accept or reject the pronunciation. For the example depicted 
in FIG. 7C, the user interface element 150 asks the user 
whether it accepts the pronunciation of the new term and 
includes a "Yes" button 152 for accepting the pronunciation, 
a "No" button 154 for rejecting the pronunciation and an 
audio output button 153 for generating audio output for the 
pronunciation of the new term. By activating these buttons, 
the user accepts or rejects the pronunciation produced by the 
text-to-speech engine 44 (see step 118 in FIG. 6). 

Where the user accepts the pronunciation, the term and 
the associated pronunciation are added to the dictionary 42 
(step 120 in FIG. 6). The associated pronunciation will be 
used in recognizing future spoken instances of the term. If, 
however, the user does not accept the pronunciation, the 
system then prompts the user for the pronunciation (step 122 
in FIG. 6). The speech recognition system 136 realizes that 



13,804 

8 

the pronunciation produced by the dictionary 42 and letter- 
to-sound rules 46 was not acceptable to user and, thus, asks 
the user to produce a representation of the pronunciation for 
the word. The speech recognition system 36 displays a 

5 dialog box 160 like that depicted in FIG. 7D to request the 
user to enter the pronunciation for the new term. The dialog 
box 160 includes a text box 162 in which a user may enter 
a text string that spells out how the new terms should sound. 
After the user has entered text into the text box 162, the user 
may activate button 164 to hear how the system interprets 
the text that is entered in the text box. Specifically, the 
system generates a spoken representation of the pronuncia- 
tion entered in the text box 162 that is output over the loud 
speaker 24. Once the user has entered a text string that 
produces an acceptable pronunciation, the user may change 

15 the pronunciation by activating the "OK" button 166. The 
user may also cancel the change in the pronunciation by 
activating the "Cancel" button 168. In general, the system 
will prompt the user for pronunciation (step 122 ), receive 
the pronunciation entered by the user (step 124 ) and output 

20 the pronunciation that has been received from the user until 
the user accepts the resulting pronunciation. Alternatively, 
the system may compare the current pronunciation with 
newly added pronunciation entered by the user and if close 
enough, not prompt the user again to accept or reject. 

25 The system may also transparently update the pronuncia- 
tions stored within the dictionary 42 without explicit user 
request. This mechanism may be utilized independent of the 
above-described user initiated approach to updating the 
pronunciation of a term stored within the dictionary 42 or in 

30 conjunction with that approach. Initially, the system is 
provided with a corpus of pronunciation data and the system 
applies an algorithm such as a classification and regression 
tree ("CART") algorithm to derive alternative pronuncia- 
tions for the associated terms (step 170 in FIG. 8). CART 

35 algorithms are well-known in the art and are described in 
numerous publications including Breiman et al., Classifica- 
tion and Regression Trees, 1984. Those skilled in the art will 
appreciate that other heuristics may be applied to derive the 
pronunciations. The derived alternative pronunciations are 

40 stored for later use. When a user speaks a term and the term 
is recognized, the system compares how the user spoke the 
term with the alternative pronunciations stored for the term 
(step 172 in FIG. 8). This process is repeated (see return 
arrow to step 172 in FIG. 8) until the system is confident that 

45 it can accurately identify which of the alternative pronun- 
ciations the user is using (see step 174 in FIG. 8). The system 
may, for example, require that a desired number of hits for 
one of the alternative pronunciations be received before the 
system reaches a level of confidence sufficient so as to 

50 identify that pronunciation as the pronunciation that the user 
is using. The speech recognition system 36 then changes the 
dictionary 42 to use the pronunciation favored by the user 
(i.e., the pronunciation that the system identified as that 
being used by the user) (step 176 in FIG. 8). 

55 While the preferred embodiment of the present invention 
has been described with reference to a preferred embodi- 
ment thereof, those skilled in the art would appreciate that 
various changes in form and detail may be made without 
departing from the intended scope of the present invention 

60 as defined in the appended claims. 
We claim: 

1. In a computer-implemented speech recognition system 
that recognizes speech input from a speaker and that 
includes an audio output device, a method comprising the 
65 computer-implemented steps of: 

providing a text-lo-speech mechanism for creating a spo- 
ken version of text; 
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for a given word of text, using the text-to-speech mecha- 
nism to generate a spoken version of the given word; 

outputting the spoken version of the given word on the 
audio output device so that a user of the speech 
recognition system knows how the speech recognition 5 
system expects the given word to be pronounced; and 

providing a user interface element for a user to request a 
different pronunciation of the given word and wherein 
the spoken version of the given word is output in 
response to the user requesting the different pronun- 30 
ciation of the given word via the user interface element. 

2. The method of claim 1 wherein the user interface 
element is an activatable button that the user activates to 
request a proper pronunciation of a portion of text. 

3. The method of claim 1 wherein the speech recognition i5 
system includes a list of words that the speech recognition 
system recognizes and a mechanism for the user to add 
words to the list and wherein the using step and the output- 
ting step are triggered by the user adding a new word to the 
list such that a spoken version of the new word is output. 

4. The method of claim 1, further comprising the step of 20 
receiving the textual representation of the given word from 
the user prior to using the text-to-speech mechanism. 

5. The method of claim 1, further comprising the steps of: 
receiving a designation of a different spoken version of 25 

the given word from the user as a proper pronunciation 
of the given word; and 
modifying how the speech recognition system expects the 
given word to be pronounced to reflect the different 
spoken versions of the given word designated by the 3Q 
user 

6. The method of claim 1 wherein the speech recognition 
system is used in a dictation system for converting spoken 
speech into text. 

7. The method of claim 1 wherein the speech recognition 35 
system has at least one expected pronunciation for the given 
word and the spoken version of the given word generated by 
the text-to-speech mechanism corresponds to the expected 
pronunciation of the given word. 

8. In a computer-implemented dictation system for con- ^ 
verting spoken input from a user into text, a method com- 
prising the steps of: 

providing a list of pronunciation for words that are 

recognized by the dictation system; 
providing an audible current pronunciation of a selected 45 

word stored in the list; 
receiving a request from a user to change the current 

pronunciation of the selected word that is stored in the 

list to a new pronunciation, said request specifying the 

new pronunciation; and 50 
changing the pronunciation stored in the list for the 

selected word from the current pronunciation to the 

new pronunciation. 

9. The method of claim 8, further comprising the step of 
providing a user interface through which the user makes the 55 
request to change the current pronunciation of the selected 
word. 

10. The method of claim 9 wherein the user interface 
enables a user to spell out the new pronunciation of the 
selected word with letters. 60 

11. The method of claim 8 wherein the dictation system 
includes an audio output device and wherein the step of 
providing an audible current pronunciation of the selected 
word to the user is performed before receiving the request. 

12. The method of claim 8 wherein the dictation system 65 
includes an audio output device and wherein receiving the 
request step further comprises the steps of: 
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receiving a text string that specifies what the user believes 
the new pronunciation of the selected word sounds like; 

providing a user interface element for a user to hear how 
the dictation system envisions the text string sounding 
like; 

providing a text-to-speech engine for converting text into 
speech having an associated pronunciation; 

using the text-to-speech engine to output speech for the 
text string on the audio output device in response to the 
user using the user interface element; and 

generating the request in response to the user accepting 
the speech generated by the text-to-speech engine for 
the text string as a proper pronunciation of the selected 
word. 

13. The method of claim 12, further comprising the step 
of providing an additional user interface element for the user 
to accept or reject the pronunciation associated with the 
speech that is output by the text-to-speech engine. 

14. In a computer-implemented speech recognition sys- 
tem that recognizes speech input from a speaker and that 
includes an audio output device, a method comprising the 
steps of: 

providing a dictionary of terms that the speech recogni- 
tion system recognizes, said dictionary specifying how 
the speech recognition system expects each term to be 
pronounced; 

receiving a request from a user to add a new term to the 
dictionary; 

generating a pronunciation for the new term by the speech 

recognition system; 
outputting the pronunciation for the new term on the 

audio output device so a user can observe and change 

the pronunciation for the new term; and 
adding the new term and the generated pronunciation to 

the dictionary. 

15. The method of claim 14 wherein the speech recogni- 
tion system includes a text-to-speech engine for converting 
text into speech and wherein the text-to-speech engine is 
used to output the pronunciation of the new term. 

16. The method of claim 15 wherein the texl-to-speech 
engine uses letter-to-sound rules are used to generate the 
pronunciation for the new term. 

17. The method of claim 14 wherein the method further 
comprises the step of prompting the user to verify that the 
generated pronunciation of the new term is correct. 

18. The method of claim 17 wherein when the user 
verifies that the generated pronunciation of the new term is 
not correct, receiving a designation of a proper pronuncia- 
tion for the new term from the user and adding the proper 
pronunciation to the dictionary. 

19. In a computer-implemented speech recognition sys- 
tem for recognizing speech spoken from a speaker, said 
system including an audio output device and a text-to- 
speech engine for generating speech from text, a method 
comprising the steps of: 

storing multiple pronunciations for a selected word in a 
dictionary that is used by the text-to-speech engine; 

outputting each of the pronunciations on the audio output 
device so that a user can hear the pronunciations; and 

in response to a user selecting one of the pronunciations, 
using the selected pronunciation by the speech recog- 
nition system to recognize speech. 

20. The method of claim 19 wherein the speech recogni- 
tion system is used in a dictation system that converts speech 
spoken by a speaker into text. 
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21. In a computer-implemented speech recognition sys- spoken version of the given word is output in response 
tem for recognizing speech from a speaker, a method com- to the user requesting the proper pronunciation of the 
prising the steps of: given word via the user interface element. 

providing a dictionary of terms having pronunciations for 29. The computer-readable medium of claim 28 wherein 

the terms that correspond with how the speech recog- 5 the user interface element is an activatable button that the 

nition system expects the terms to be pronounced; user activates to request a proper pronunciation of a portion 

performing a heuristic to derive alternative pronuncia- of text^ ^^^^ medium of ^ M whereifl 

tions for the terms, ^ cfa recognition system mcm des a list of words that 

on multiple instances where the speaker speaks a selected the speech recogmtion system reC ognizes and a mechanism 

one of terms such that the speech recognition system 10 fof me uger tQ a(Jd WOfds tQ lhe ^ and w h erem the using 

recognizes the selected term, determining which of the step and tne 0Ut p U tting step are triggered by the user adding 

alternative pronunciations of the selected terms the user a new wor( j to me list SU ch that a spoken version of the new 

used; and word is output. 

based on the determining step, identifying which of the 31. The computer-readable medium of claim 28 wherein 

alternative pronunciations of the selected term the user 15 the method further comprises the steps of: 

is most likely using and updating the dictionary to receiving a designation of a different spoken version of 

designate the pronunciation that the user is most likely the given word from the user as a proper pronunciation 

using as how the speech recognition system expects the of the given word; and 

selected term to be pronounced. modifying how the speech recognition system expects the 

22. The method of claim 21 wherein the CART algorithm 20 given word to be pronounced to reflect the different 
is applied to derive the alternative pronunciations. spoken versions of the given word designated by the 

23. In a computer- implemented speech recognition sys- user, 

tem for recognizing spoken speech from a speaker, said 32. The computer-readable medium of claim 28 wherein 

system having an output device, a method comprising the the speech recognition system is used in a dictation system 

«;tens nf 25 ^ or convertin S spoken speech into text. 

p . . , 33. The computer-readable medium of claim 28 wherein 

receiving a spoken version of a term having a given ^ rec £ gn itiora system has at least one expected 

pronunciation from the speaker; pronunciation for the given word and the spoken version of 

providing an expected pronunciation for the term that the given word g enera ted by the text-to-speech mechanism 

corresponds to how the speech recognition system ^ corresponds to the expected pronunciation of the given 

expects the speaker to speak the term; word. 

comparing the given pronunciation of the spoken version 34. In a computer-implemented dictation system for con- 

of the term with the expected pronunciation of the term verting spoken input from a user into text, a computer- 

to determine a degree of difference between the given readable medium holding computer-executable instructions 

pronunciation of the spoken version of the term and the for performing a method comprising the steps of: 

expected pronunciation of the term; and providing a list of pronunciations for words that are 

where the degree of difference exceeds an acceptable recognized by the dictation system; 

predetermined threshold, generating output on the out- providing an audible current pronunciation of a selected 

put device to inform the speaker that the degree of word stored in the list; 

difference exceeds the threshold. ^ receiving a request from a user to change the current 

24. The method of claim 23 wherein the speech recogni- pronunciation of the selected word that is stored in the 
tion system is used in a dictation system for generating text list to a new pronunciation, said request specifying the 
from speech. new pronunciation; and 

25. The method of claim 23 wherein the output device is changing the pronunciation stored in the list for the 
an audio output device and wherein the output is audio 45 selected word from the current pronunciation to the 
output. new pronunciation. 

26. The method of claim 23 wherein the output device is 35. The computer-readable medium of claim 34 wherein 
a video output device and wherein the output is video output. the method further comprises the step of providing a user 

27. The method of claim 23 wherein the system includes interface through which the user makes the request to 
a text-to-speech mechanism and wherein the text-to-speech 5Q change the current pronunciation of the selected word, 
mechanism generates a spoken version of the term. 36. The computer-readable medium of claim 35 wherein 

28. In a computer- implemented speech recognition sys- the user interface enables a user to spell out the new 
tem that recognizes speech input from a speaker and that pronunciation of the selected word with letters, 
includes an audio output device, a computer-readable 37 1 The computer-readable medium of claim 34 wherein 
medium holding computer-executable instructions for per- 55 the dictation system includes an audio output device and 
forming a method comprising the computer-implemented wherein the method further comprises the step of outputting 
steps of: the current pronunciation of the selected word to the user 

providing a text-to-speech mechanism for creating a spo- with the audio output device before receiving the request. 

ken version of text; 38. The computer-readable medium of claim 34 wherein 

for a given word of text, using the text-to-speech mecha- 60 the dictation system includes an audio output device and 

nism to generate a spoken version of the given word; wherein the receiving the request step further comprises the 

outputting the spoken version of the given word on the steps of: 

audio output device so that a user of the speech receiving a text string that specifies what the user believes 

recognition system knows how the speech recognition the new pronunciation of the selected word sounds like; 

system expects the given word to be pronounced; and 65 providing a user interface element for a user to hear how 

providing a user interface element for a user to request a the dictation system envisions the text string sounding 

proper pronunciation of the given word and wherein the like; 
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providing a text-to-speech engine for converting text into 
speech having an associated pronunciation; 

using the text-to-speech engine to output speech for the 
text string on the audio output device in response to the 
user using the user interface element; and 

generating the request in response to the user accepting 
the speech generated by the text-to-speech engine for 
the text string as a proper pronunciation of the selected 
word. 

39. In a computer- implemented speech recognition sys- 
tem that recognizes speech input from a speaker and that 
includes an audio output device, a computer-readable 
medium holding computer-executable instructions for per- 
forming a method comprising the steps of: 

providing a dictionary of terms that the speech recogni- 
tion system recognizes, said dictionary specifying how 
the speech recognition system expects each term to be 
pronounced; 

receiving a request from a user to add a new term to the 
dictionary; 

generating a pronunciation for the new term by the speech 

recognition system; 
outputting the pronunciation for the new term on the 

audio output device so a user can observe and change 

the pronunciation for the new term; and 
adding the new term and the generated pronunciation to 

the dictionary. 

40. The computer-readable medium of claim 39 wherein 
the speech recognition system includes letter-to-sound rules 
for converting textual letters into sounds and wherein the 
letter- to-sound rules and/or dictionary are used to generate 
the pronunciation of the new term. 

41. The computer-readable medium of claim 39 wherein 
the method further comprises the step of prompting the user 
to verify that the generated pronunciation of the new term is 
correct. 

42. The computer-readable medium of claim 41 wherein 
when the user verifies that the generated pronunciation of 
the new term is not correct, receiving a designation of a 
proper pronunciation for the new term from the user and 
adding the proper pronunciation to the dictionary. 

43. In a computer-implemented speech recognition sys- 
tem for recognizing speech spoken from a speaker, said 
system including an audio output device and a text-to- 
speech engine for generating speech from text, a computer- 
readable medium holding computer-executable instructions 
for performing a method comprising the steps of: 

storing multiple pronunciations for a selected word in a 
dictionary that is used by the text-to-speech engine; 

outputting each of the pronunciations on the audio output 
device so that a user can hear the pronunciations; and 

in response to a user selecting one of the pronunciations, 
using the selected pronunciation by the speech recog- 
nition system to recognize speech. 

44. The computer-readable medium of claim 43 wherein 
the speech recognition system is used in a dictation system 
that converts speech spoken by a speaker into text. 

45. In a computer-implemented speech recognition sys- 
tem for recognizing speech from a speaker, a computer- 
readable medium holding computer-executable instructions 
for performing a method comprising the steps of: 

providing a dictionary of terms having pronunciations for 
the terms that correspond with how the speech recog- 
nition system expects the terms to be pronounced; 

deriving alternative pronunciations of the terms by apply- 
ing a heuristic; 
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on multiple instances where the speaker speaks a selected 
one of terms such that the speech recognition system 
recognizes the selected term, determining which of the 
alternative pronunciations of the selected terms the user 
5 used; and 

based on the determining step, identifying which of the 
alternative pronunciations of the selected term the user 
is most likely using and updating the dictionary to 
designate the pronunciation that the user is most likely 
10 using as how the speech recognition system expects the 
selected term to be pronounced. 
46. In a computer-implemented speech recognition sys- 
tem for recognizing spoken speech from a speaker, said 
system having an output device, a computer-readable 
medium holding computer-executable instructions for per- 
15 forming a method comprising the steps of: 

receiving a spoken version of a term having a given 

pronunciation from the speaker; 
providing an expected pronunciation for the term that 
corresponds to how the speech recognition system 
20 expects the speaker to speak the term; 

comparing the given pronunciation of the spoken version 
of the term with the expected pronunciation of the term 
to determine a degree of difference between the given 
pronunciation of the spoken version of the term and the 
25 expected pronunciation of the term; and 

where the degree of difference exceeds an acceptable 
predetermined threshold, generating output on the out- 
put device to inform the speaker that the degree of 
difference exceeds the threshold. 
30 47. In a computer-implemented speech recognition sys- 
tem for recognizing spoken speech from a speaker, said 
system having a display device, a method comprising the 
steps of: 

providing an expected pronunciation of a given word that 
35 constitutes how the speech recognition system expects 
the given word to be pronounced by the speaker; 

gathering statistics regarding how frequently the given 
word of spoken speech from the speaker is misrecog- 
nized by the speech recognition system; and 

where the statistics indicate that the given word is mis- 
recognized more frequently than a threshold value, 
prompting the user by generating output on the display 
device through a user interface element such that the 
45 user can request a different pronunciation to correct the 
expected pronunciation of the given word, a spoken 
version of the given word with a corrected expected 
pronunciation being output by the user interface ele- 
ment. 

48. A speech recognition system for recognizing speech 
from a speaker, comprising: 
an input device for receiving speech input from the 
speaker; 

a speech recognition engine for recognizing speech in the 
55 speech input received from the speaker by the input 
device wherein the speech recognition engine has 
expected pronunciations for portions of speech; 
a text-to-speech engine for producing a spoken represen- 
tation of text constituting a selected portion of speech; 
60 an audio output device for outputting the spoken repre- 
sentation of the text from the text-to-speech engine so 
that the user knows the expected pronunciation of the 
selected portion of speech; and 
an interface component configured to receive a new 
65 pronunciation from the user, indicative of a pronuncia- 
tion more closely conforming to a pronunciation used 
by the user. 
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49. The speech recognition system of claim 48 wherein 
the text-to-speech engine is part of the speech recognition 
engine. 

50. The speech recognition system of claim 48 wherein 
the speech recognition system further comprises a dictionary 
holding expected pronunciations of words for use by the 
text-to-speech engine in producing the spoken representa- 
tion of text. 
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51. The speech recognition system of claim 48 wherein 
the speech recognition system further comprises letter-to- 
sound rules for converting textual letters into sounds for use 
by the text-to-speech engine in producing the spoken rep- 
resentation of text. 



UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 



PATENT NO. : 5933304 

DATED : August 3, 1999 

iNVENTOR(S) : Xuedong D. Huag et al. 

It is certified that error appears in the above-Identified patent and that said Letters Patent is hereby 
corrected as shown below: 

Column 7, line 19 , delete "26" insert 126 
Column 7, line 29, delete "126" insert 128 
Column 7, line 67, delete "136" insert 36 -- 

Claims 

Column 10,. line 43, delete "are used" 



Signed and Sealed this 
Sixteenth Day of May, 2000 



Attest: 




Q. TODD DICKINSON 



Attesting Officer 



Director of Patents and Trademarks 



